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Abstract 

Learning a Gaussian mixture model (GMM) is a fundamental problem in machine learning, learning 
theory, and statistics. One notion of learning a GMM is proper learning: here, the goal is to find a 
mixture of k Gaussians Ml that is close to the density / of the unknown distribution from which we draw 
samples. The distance between Ml and / is typically measured in the total variation or Li-norm. 

We give an algorithm for learning a mixture of k univariate Gaussians that is nearly optimal for any 
fixed k. The sample complexity of our algorithm is 0{\) and the running time is (fcTog lp(fe ) 

It is well-known that this sample complexity is optimal (up to logarithmic factors), and it was already 
achieved by prior work. However, the best known time complexity for proper learning a fc-GMM was 
Q( gSfchi )• III particular, the dependence between I and k was exponential. We significantly improve 
this dependence by replacing the j term with a log I while only increasing the exponent moderately. 
Hence, for any fixed k, the 0{^) term dominates our running time, and thus our algorithm runs in time 
which is nearly-linear in the number of samples drawn. Achieving a running time of poly(fc, p for proper 
learning of fc-GMMs has recently been stated as an open problem by multiple researchers, and we make 
progress on this open problem. 

Moreover, our approach offers an agnostic learning guarantee: our algorithm returns a good GMM 
even if the distribution we are sampling from is not a mixture of Gaussians. To the best of our knowledge, 
our algorithm is the first agnostic proper learning algorithm for GMMs. Again, the closely related 
question of agnostic and proper learning for GMMs in the high-dimensional setting has recently been 
raised as an open question, and our algorithm resolves this question in the univariate setting. 

We achieve these results by approaching the proper learning problem from a new direction: we start 
with an accurate density estimate and then fit a mixture of Gaussians to this density estimate. Hence, 
after the initial density estimation step, our algorithm solves an entirely deterministic optimization prob¬ 
lem. We reduce this optimization problem to a sequence of carefully constructed systems of polynomial 
inequalities, which we then solve with Renegar’s algorithm. Our techniques for encoding proper learning 
problems as systems of polynomial inequalities are general and can be applied to properly learn further 
classes of distributions besides GMMs. 


‘Supported by NSF grant CCF-1217921 and DOE grant DE-SC0008923. 

1 Supported by MAD ALGO and a grant from the MIT-Shell Energy Initiative. 
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1 Introduction 


Gaussian mixture models (GMMs) are one of the most popular and important statistical models, both 
in theory and in practice. A sample is drawn from a GMM by first selecting one of its k components 
according to the mixing weights, and then drawing a sample from the corresponding Gaussian, each of which 
has its own set of parameters. Since many phenomena encountered in practice give rise to approximately 
normal distributions, GMMs are often employed to model distributions composed of several distinct sub¬ 
populations. GMMs have been studied in statistics since the seminal work of Pearson [Pea94] and are now 
used in many fields including astronomy, biology, and machine learning. Hence the following are natural and 
important questions: (i) how can we efficiently “learn” a GMM when we only have access to samples from 
the distribution, and (ii) what rigorous guarantees can we give for our algorithms? 

1.1 Notions of learning 

There are several natural notions of learning a GMM, all of which have been studied in the learning theory 
community over the last 20 years. The known sample and time complexity bounds differ widely for these 
related problems, and the corresponding algorithmic techniques are also considerably different (see Table 1 
for an overview and a comparison with our results). In order of decreasing hardness, these notions of learning 
are: 

Parameter learning The goal in parameter learning is to recover the parameters of the unknown GMM 
(i.e., the means, variances, and mixing weights) up to some given additive error e.^ 

Proper learning In proper learning, our goal is to find a GMM M' such that the Li-distance (or equiva¬ 
lently, the total variation distance) between our hypothesis M' and the true unknown distribution is 
small. 

Improper learning / density estimation Density estimation requires us to find any hypothesis h such 
that the Li distance between h and the unknown distribution is small {h does not need to be a GMM). 

Parameter learning is arguably the most desirable guarantee because it allows us to recover the unknown 
mixture parameters. For instance, this is important when the parameters directly correspond to physical 
quantities that we wish to infer. However, this power comes at a cost: recent work on parameter learning 
has shown that f2(-^) samples are already necessary to learn the parameters of a mixture of two univariate 
Gaussians with accuracy e [HP15] (note that this bound is tight, i.e., the paper also gives an algorithm 
with time and sample complexity 0{^)). Moreover, the sample complexity of parameter learning scales 
exponentially with the number of components: for a mixture of k univariate Gaussians, the above paper 
also gives a sample complexity lower bound of fl( jgi^)- Hence the sample complexity of parameter learning 
quickly becomes prohibitive, even for a mixture of two Gaussians and reasonable choices of e. 

At the other end of the spectrum, improper learning has much smaller time and sample complexity. 
Recent work shows that it is possible to estimate the density of a univariate GMM with k components 
using only O(^) samples and time [ADLS15], which is tight up to logarithmic factors. However, the output 
hypothesis produced by the corresponding algorithm is only a piecewise polynomial and not a GMM. This 
is a disadvantage because GMMs are often desirable as a concise representation of the samples and for 
interpretability reasons. 

Hence an attractive intermediate goal is proper learning: similar to parameter learning, we still produce a 
GMM as output. On the other hand, we must only satisfy the weaker Li-approximation guarantee between 
our hypothesis and the unknown GMM. While somewhat weaker than parameter learning, proper learning 
still offers many desirable features: for instance, the representation as a GMM requires only 3k—1 parameters, 
which is significantly smaller than the at least 6fc(l -I- 2 log i) many parameters produced by the piecewise 
polynomial density estimate of [ADLS15] (note that the number of parameters in the piecewise polynomial 

^Since the accuracy e depends on the scale of the mixture (i.e., the variance), these guarantees are often specified relative to 
the variance. 
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Problem type 

Sample complexity 
lower bound 

Sample complexity 
upper bound 

Time complexity 
upper bound 

Agnostic 

guarantee 

Parameter learning 

k = 2 

0(^) [HP15] 

O(j^) [HP15] 

O(^) [HP15] 

no 

general k 

n{^) [HP15] 

Ci((i)^'') [MVIO] 

0((i)='=) [MVIO] 

no 

Proper learning 

k = 2 

0(i^) 

0(A) [DK14] 

0(A) [DK14] 

no 

general k 

0(^) 

O(^) [AJOS14] 

0(^3,^) [DK14, AJ0S14] 

no 

Our results 

k = 2 


O(^) 

O(^) 

yes 

general k 


O(^) 

(fclogi)0('=") + 0(A) 

yes 

Density estimation 

general k 

0(,l) 

0(A) [ADLS15] 

0(A) [ADLS15] 

yes 


Table 1: Overview of the best known results for learning a mixture of univariate Gaussians. Our contributions 
(highlighted as bold) significantly improve on the previous results for proper learning: the time complexity of 
our algorithm is nearly optimal for any fixed k. Moreover, our algorithm gives agnostic learning guarantees. 
The constant Cfc in the time and sample complexity of [MVIO] depends only on k and is at least k. The 
sample complexity lower bounds for proper learning and density estimation are folklore results. The only 
time complexity lower bounds known are the corresponding sample complexity lower bounds, so we omit an 
extra column for time complexity lower bounds. 

also grows as the accuracy of the density estimate increases). Moreover, the representation as a GMM 
allows us to provide simple closed-form expressions for quantities such as the mean and the moments of the 
learnt distribution, which are then easy to manipulate and understand. In contrast, no such closed-form 
expressions exist when given a general density estimate as returned by [ADLS15]. Furthermore, producing a 
GMM as output hypothesis can be seen as a regularization step because the number of peaks in the density 
is bounded by k, and the density is guaranteed to be smooth. This is usually an advantage over improper 
hypotheses such as piecewise polynomials that can have many more peaks or discontinuities, which makes 
the hypothesis harder to interpret and process. Finally, even the most general parameter learning algorithms 
require assumptions on the GMMs such as identifiability, while our proper learning algorithm works for any 
GMM. 

Ideally, proper learning could combine the interpretability and conciseness of parameter learning with 
the small time and sample complexity of density estimation. Indeed, recent work has shown that one can 
properly learn a mixture of k univariate Gaussians from only O(^) samples [DK14, AJOS14], which is tight 
up to logarithmic factors. However, the time complexity of proper learning is not yet well understood. This 
is in contrast to parameter learning and density estimation, where we have strong lower bounds and nearly- 
optimal algorithms, respectively. For the case of two mixture components, the algorithm of [DK14] runs 
in time O(^). However, the time complexity of this approach becomes very large for general k and scales 
as 0( ^3fc-i ) [DK14, AJOS14]. Note that this time complexity is much larger than the O(^) required for 
density estimation and resembles the exponential dependence between ^ and k in the n( ^^^- 2 ) lower bound 
for parameter learning. Hence the true time complexity of properly learning a GMM is an important open 
question. In particular, it is not known whether the exponential dependence between - and k is necessary. 

In our work, we answer this question and show that such an exponential dependence between ^ and k 
can be avoided. We give an algorithm with the same (nearly optimal) sample complexity as previous work. 
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but our algorithm also runs in time which is nearly-optimal, i.e. nearly-linear in the number of samples, for 
any fixed k. It is worth noting that proper learning of fc-GMMs in time poly(A:, -) has been raised as an open 
problem [Moil4, BDKVDL15], and we make progress on this question. Moreover, our learning algorithm is 
agnostic, which means that the algorithm tolerates arbitrary amounts of worst-case noise in the distribution 
generating our samples. This is an important robustness guarantee, which we now explain further. 

1.2 Robust learning guarantees 

All known algorithms for properly learning or parameter learning^ a GMM offer rigorous guarantees only 
when there is at most a very small amount of noise in the distribution generating our samples. Typically, 
these algorithms work in a “non-agnostic” setting: they assume that samples come from a distribution that 
is exactly a GMM with probability density function (pdf) /, and produce a hypothesis pdf h such that for 
some given e > 0 

ll/-^lli < e- 

But in practice, we cannot typically expect that we draw samples from a distribution that truly corresponds 
to a GMM. While many natural phenomena are well approximated by a GMM, such an approximation is 
rarely exact. Instead, it is useful to think of such phenomena as GMMs corrupted by some amount of noise. 
Hence it is important to design algorithms that still provide guarantees when the true unknown distribution 
is far from any mixture of k Gaussians. 

Agnostic learning Therefore, we focus on the problem of agnostic learning [KSS94], where our samples 
can come from any distribution, not necessarily a GMM. Let / be the pdf of this unknown distribution 
and let Alfc be the set of pdfs corresponding to a mixture of k Gaussians. Then we define OPT^ to be the 
following quantity: 

OPTfc min \\f - h\\^ , 
hGMk 

which is the error achieved by the best approximation of / with a fc-GMM. Note that this is a deterministic 
quantity, which can also be seen as the error incurred when projecting / onto set Aik- Using this definition, 
an agnostic learning algorithm produces a GMM with density h such that 

||/-^|li < C-OPTfc + e 

for some given e > 0 and a universal constant^ C that does not depend on e, k, or the unknown pdf /. 

Glearly, agnostic learning guarantees are more desirable because they also apply when the distribution 
producing our samples does not match our model exactly (note also that agnostic learning is strictly more 
general than non-agnostic learning). Moreover, the agnostic learning guarantee is “stable”: when our model is 
close to the true distribution /, the error of the best approximation, i.e., OPT^, is small. Hence an agnostic 
learning algorithm still produces a good hypothesis. 

On the other hand, agnostic learning algorithms are harder to design because we cannot make any 
assumptions on the distribution producing our samples. To the best of our knowledge, our algorithm is 
the first agnostic algorithm for properly learning a mixture of k univariate Gaussians. Note that agnostic 
learning has recently been raised as an open question in the setting of learning high-dimensional GMMs 
[Veml2], and our agnostic univariate algorithm can be seen as progress on this problem. Moreover, our 
algorithm achieves such an agnostic guarantee without any increase in time or sample complexity compared 
to the non-agnostic case. 

^Note that parameter learning is not well-defined if the samples do not come from a GMM. Instead, existing parameter 
learning algorithms are not robust in the following sense: if the unknown distribution is not a GMM, the algorithms are not 
guaranteed to produce a set of parameters such that the corresponding GMM is close to the unknown density. 

^Strictly speaking, agnostic learning requires this constant C to be 1. However, such a tight guarantee is impossible for 
some learning problems such as density estimation. Hence we allow any constant C in an agnostic learning guarantee, which is 
sometimes also called semi-agnostic learning. 
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1.3 Our contributions 


We now outline the main contributions of this paper. Similar to related work on proper learning [DK14], we 
restrict our attention to univariate GMMs. Many algorithms for high-dimensional GMMs work via reductions 
to the univariate case, so it is important to understand this case in greater detail [KMVIO, MVIO, HP15]. 

First, we state our main theorem. See Section 2 for a formal definition of the notation. The quantity 
OPTfc is the same as introduced above in Section 1.2. 

Theorem 1. Let f be the pdf of an arbitrary unknown distribution, let k be a positive integer, and let e > 0. 
Then there is an algorithm that draws O(^) samples from the unknown distribution and produces a mixture 
of k Gaussians such that the corresponding pdf h satisfies 

\\f-h\\, < 42-OPTfc+ e. 


Moreover, the algorithm runs in time 



We remark that we neither optimized the exponent 0{k‘^), nor the constant in front of OPTfc. Instead, 
we see our result as a proof of concept that it is possible to agnostically and properly learn a mixture of 
Gaussians in time that is essentially fixed-parameter optimal. As mentioned above, closely related questions 
about efficient and agnostic learning of GMMs have recently been posed as open problems, and we make 
progress on these questions. In particular, our main theorem implies the following contributions: 

Running time The time complexity of our algorithm is significantly better than previous work on proper 
learning of GMMs. For the special case of 2 mixture components studied in [DK14] and [HP15], our running 
time simplifies to O(t-). This is a significant improvement over the 0{\) bound in [DK14]. Moreover, 
our time complexity matches the best possible time complexity for density estimation of 2-GMMs up to 
logarithmic factors. This also implies that our time complexity is optimal up to log-factors. 

For proper learning of k mixture components, prior work achieved a time complexity of 0{ ^ 3^-1 ) [DK14, 
AJOS14]. Gompared to this result, our algorithm achieves an exponential improvement in the dependence 
between ^ and k: our running time contains only a (log i) term raised to the poly(fc)-th power, not a (^)^. 
In particular, the O(^) term in our running time dominates for any fixed k. Hence the time complexity of 
our algorithm is nearly optimal for any fixed k. 

Agnostic learning Our algorithm is the first proper learning algorithm for GMMs that is agnostic. Previ¬ 
ous algorithms relied on specific properties of the normal distribution such as moments, while our techniques 
are more robust. Practical algorithms should offer agnostic guarantees, and we hope that our approach is 
a step in this direction. Moreover, it is worth noting that agnostic learning, i.e., learning under noise, is 
often significantly harder than non-agnostic learning. One such example is learning parity with noise, which 
is conjectured to be computationally hard. Hence it is an important question to understand which learning 
problems are tractable in the agnostic setting. While the agnostic guarantee achieved by our algorithm is 
certainly not optimal, our algorithm still shows that it is possible to learn a mixture of Gaussians agnostically 
with only a very mild dependence on A 

From improper to proper learning Our techniques offer a general scheme for converting improper 
learning algorithms to proper algorithms. In particular, our approach applies to any parametric family 
of distributions that are well approximated by a piecewise polynomial in which the parameters appear 
polynomially and the breakpoints depend polynomially (or rationally) on the parameters. As a result, 
we can convert purely approximation-theoretic results into proper learning algorithms for other classes of 
distributions, such as mixtures of Laplace or exponential distributions. Gonceptually, we show how to 
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approach proper learning as a purely deterministic optimization problem once a good density estimate is 
available. Hence our approach differs from essentially all previous proper learning algorithms, which use 
probabilistic arguments in order to learn a mixture of Gaussians. 

1.4 Techniques 

At its core, our algorithm fits a mixture of Gaussians to a density estimate. In order to obtain an e-accurate 
and agnostic density estimate, we invoke recent work that has a time and sample complexity of 0{^) 
[ADLS15]. The density estimate produced by their algorithm has the form of a piecewise polynomial with 
0{k) pieces, each of which has degree 0(log -). It is important to note that our algorithm does not draw 
any furthers samples after obtaining this density estimate — the process of fitting a mixture of Gaussians is 
entirely deterministic. 

Once we have obtained a good density estimate, the task of proper learning reduces to fitting a mixture of 
k Gaussians to the density estimate. We achieve this via a further reduction from fitting a GMM to solving 
a carefully designed system of polynomial inequalities. We then solve the resulting system with Renegar’s 
algorithm [Ren92a, Ren92b]. This reduction to a system of polynomial inequalities is our main technical 
contribution and relies on the following techniques. 

Shape-restricted polynomials Ideally, one could directly fit a mixture of Gaussian pdfs to the density 
estimate. However, this is a challenging task because the Gaussian pdf —2 is not convex in the 
parameters /r and cr. Thus fitting a mixture of Gaussians is a non-convex problem. 

Instead of fitting mixtures of Gaussians directly, we instead use the notion of a shape restricted polynomial. 
We say that a polynomial is shape restricted if its coefficients are in a given semialgebraic set, i.e., a set 
defined by a finite number of polynomial equalities and inequalities. It is well-known in approximation theory 
that a single Gaussian can be approximated by a piecewise polynomial consisting of three pieces with degree 
at most 0(log i) [Tim63]. So instead of fitting a mixture of k Gaussian directly, we instead fit a mixture of 
k shape-restricted piecewise polynomials. By encoding that the shape-restricted polynomials must have the 
shape of Gaussian pdfs, we ensure that the mixture of shape-restricted piecewise polynomials found by the 
system of polynomial inequalities is close to a true mixture of fc-Gaussians. After we have solved the system 
of polynomial inequalities, it is easy to convert the shape-restricted polynomials back to a proper GMM. 

Aif-distance The system of polynomial inequalities we use for finding a good mixture of piecewise poly¬ 
nomials must encode that the mixture should be close to the density estimate. In our final guarantee for 
proper learning, we are interested in an approximation guarantee in the Li-norm. However, directly encod¬ 
ing the Li-norm in the system of polynomial inequalities is challenging because it requires knowledge of the 
intersections between the density estimate and the mixture of piecewise polynomials in order to compute the 
integral of their difference accurately. Instead of directly minimizing the Li-norm, we instead minimize the 
closely related A/c-norm from VG (Vapnik-Ghervonenkis) theory [DLOl]. For functions with at most K —1 
sign changes, the A/f-norm exactly matches the Li-norm. Since two mixtures of k Gaussians have at most 
0{k) intersections, we have a good bound on the order of the A/c-norm we use to replace the Li-norm. In 
contrast to the Li-norm, we can encode the Aic-norm without increasing the size of our system of polynomial 
inequalities significantly — directly using the Li-norm would lead to an exponential dependence on log ^ in 
our system of polynomial inequalities. 

Adaptively rescaling the density estimate In order to use Renegar’s algorithm for solving our system 
of polynomial inequalities, we require a bound on the accuracy necessary to find a good set of mixture 
components. While Renegar’s algorithm has a good dependence on the accuracy parameter, our goal is to 
give an algorithm for proper learning without any assumptions on the GMM. Therefore, we must be able 
to produce good GMMs even if the parameters of the unknown GMM are, e.g., doubly-exponential in ^ or 
even larger. Note that this issue arises in spite of the fact that our algorithm works in the real-RAM model: 
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since different mixture parameters can have widely variying scales, specifying a single accuracy parameter 
for Renegar’s algorithm is not sufficient. 

We overcome this technical challenge by adaptively rescaling the parametrization used in our system of 
polynomial inequalities based on the lengths of the intervals Ii,..., Ig that define the piecewise polynomial 
density estimate pdens- Since pdens can only be large on intervals li of small length, the best Gaussian fit to 
Pdens can only have large parameters near such intervals. Hence, this serves as a simple way of identifying 
where we require more accuracy when computing the mixture parameters. 

Putting things together Combining the ideas outlined above, we can fit a mixture of k Gaussians 
with a carefully designed system of polynomial inequalities. A crucial aspect of the system of polynomial 
inequalities is that the number of variables is 0{k), that the number of inequalities is k^^^\ and the degree of 
the polynomials is bounded by 0(log ^). These bounds on the size of the system of polynomial inequalities 
then lead to the running time stated in Theorem 1. In particular, the size of the system of polynomial 
inequalities is almost independent of the number of samples, and hence the running time required to solve 
the system scales only poly-logarithmically with 

1.5 Related work 

Due to space constraints, it is impossible to summarize the entire body of work on learning GMMs here. 
Therefore, we limit our attention to results with provable guarantees corresponding to the notions of learn¬ 
ing outlined in Subsection 1.1. Note that this is only one part of the picture: for instance, the well-known 
Expectation-Maximization (EM) algorithm is still the subject of current research (see [BWY14] and refer¬ 
ences therein). 

For parameter learning, the seminal work of Dasgupta [Das99] started a long line of research in the 
theoretical computer science community, e.g., [SKOl, VW04, AM05, KSV08, BV08, KMVIO]. We refer the 
reader to [MVIO] for a discussion of these and related results. The papers [MVIO] and [BSIO] were the first to 
give polynomial time algorithms (polynomial in e and the dimension of the mixture) with provably minimal 
assumptions for fc-GMMs. More recently, Hardt and Price gave tight bounds for learning the parameters 
of a mixture of 2 univariate Gaussians [HP 15]: 0 (^ 12 ) samples are necessary and sufficient, and the time 
complexity is linear in the number of samples. Moreover, Hardt and Price give a strong lower bound of 
r2( ^ek- 2 ) for the sample complexity of parameter learning a /c-GMM. While our proper learning algorithm 
offers a weaker guarantee than these parameter learning approaches, our time complexity does not have an 
exponential dependence between f and k. Moreover, proper learning retains many of the attractive features 
of parameter learning (see Subsection 1.1). 

Interestingly, parameter learning becomes more tractable as the number of dimensions increases. A recent 
line of work investigates this phenomenon under a variety of assumptions (e.g., non-degeneracy or smoothed 
analysis) [HK13, BGMV14, ABG+14, GHK15]. However, all of these algorithms require a lower bound on 
the dimension d such as d> il{k) or d > id{k^). Since we focus on the one-dimensional case, our results are 
not directly comparable. Moreover, to the best of our knowledge, none of the parameter learning algorithms 
(in any dimension) provide proper learning guarantees in the agnostic setting. 

The first work to consider proper learning of fc-GMMs without separation assumptions on the components 
was [FSO06]. Their algorithm takes poly(d, 1/e, T) samples and returns a mixture whose KL-divergence to 
the unknown mixture is at most e. Unfortunately, their algorithm has a pseudo-polynomial dependence on 
L, which is a bound on the means and the variances of the underlying components. Note that such an 
assumption is not necessary a priori, and our algorithm works without any such requirements. Moreover, 
their sample complexity is exponential in the number of components k. 

The work closest to ours are the papers [DK14] and [AJOS14], who also consider the problem of properly 
learning a fc-GMM. Their algorithms are based on constructing a set of candidate GMMs that are then 
compared via an improved version of the Scheffe-estimate. While this approach leads to a nearly-optimal 
sample complexity of O(^), their algorithm constructs a large number of candidate hypothesis. This leads 
to a time complexity of 0( ^sk-i )■ As pointed out in Subsection 1.1, our algorithm significantly improves the 
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dependence between - and k. Moreover, none of their algorithms are agnostic. 

Another related paper on learning GMMs is [BSZ15]. Their approach reduces the learning problem to 
finding a sparse solution to a non-negative linear system. Conceptually, this approach is somewhat similar 
to ours in that they also fit a mixture of Gaussians to a set of density estimates. However, their algorithm 
does not give a proper learning guarantee: instead of k mixture components, the GMM returned by their 
algorithm contains O(^) components. Note that this number of components is significantly larger than the 
k components returned by our algorithm. Moreover, their number of components increases as the accuracy 
paramter e improves. In the univariate case, the time and sample complexity of their algorithm is 0{\). 
Note that their sample complexity is not optimal and roughly ^ worse than our approach. For any fixed 

k, our running time is also better by roughly Furthermore, the authors do not give an agnostic learning 
guarantee for their algorithm. 

For density estimation, there is a recent line of work on improperly learning structured distributions 
[CDSS13, CDSS14, ADLS15]. While the most recent paper from this line achieves a nearly-optimal time 
and sample complexity for density estimation of fc-GMMs, the hypothesis produced by their algorithm is a 
piecewise polynomial. As mentioned in Subsection 1.1, GMMs have several advantages as output hypothesis. 

l. 6 Outline of our paper 

In Section 2, we introduce basic notation and important known results that we utilize in our algorithm. 
Section 3 describes our learning algorithm for the special case of well-behaved density estimates. This 
assumption allows us to introduce two of our main tools (shape-restricted polynomials and the Aif-distance as 
a proxy for Li) without the technical details of adaptively reparametrizing the shape-restricted polynomials. 
Section 4 then removes this assumption and gives an algorithm that works for agnostically learning any 
mixture of Gaussians. In Section 5, we show how our techniques can be extended to properly learn further 
classes of distributions. 


2 Preliminaries 

Before we construct our learning algorithm for GMMs, we introduce basic notation and the necessary tools 
from density estimation, systems of polynomial inequalities, and approximation theory. 


2.1 Basic notation and definitions 


For a positive integer k, we write [A:] for the set {1,..., fc}. Let I = [a, /3] be an interval. Then we denote the 
length of / with \I\ = j3 — a. For a measurable function / : M —>■ M, the Li-norm of / is ||/||j^ = J f{x) da;. 
All functions in this paper are measurable. 

Since we work with systems of polynomial inequalities, it will be convenient for us to parametrize the 
normal distribution with the precision, i.e., one over the standard deviation, instead of the variance. Thus, 
throughout the paper we let 


Af, 


II,T 


'x) = 






denote the pdf of a normal distribution with mean p, and precision r. A fc-GMM is a distribution with pdf 
of the form '^i ' ■^iii,Ti{x), where we call the Wi mixing weights and require that the Wi satisfy Wi > 0 

and ~ 1- Thus a /c-GMM is parametrized by ik parameters; namely, the mixing weights, means, 

and precisions of each component.^ We let 0^ = 5^ x x be the set of parameters, where Sk is the 
simplex in k dimensions. For each 9 G 0^, we identify it canonically with 6 = (w,p,t) where w,p, and t 
are each vectors of length k, and we let 


k 

Meix) ='^Wi ■ Afi_,i,nix) 

■^Note that there are only 3k — 1 degrees of freedom since the mixing weights must sum to 1. 



be the pdf of the fc-GMM with parameters 9. 


2.2 Important tools 

We now turn our attention to results from prior work. 

2.2.1 Density estimation with piecewise polynomials 

Our algorithm uses the following result about density estimation of fc-GMMs as a subroutine. 

Fact 2 ([ADLS15]). Let k > 1, e > 0 and <5 > 0. There is an algorithm Estimate-Density(A:, e, 5) that 
satisfies the following properties: the algorithm 

• takes 0{{k + log(l/(5))/e^) samples from the unknown distribution with pdf f, 

• runs in time 0((k+ \og\/5)/e^), and 

• returns pdens, Bin 0(k)-piecewise polynomial of degree 0(log(l/e)) such that 

I|/-Pdens||l < l-OPTfc+e 

with probability at least 1 — <5, where 

OPTfc = min 11/-Aiello . 


2.2.2 Systems of polynomial inequalities 

In order to fit a fc-GMM to the density estimate, we solve a carefully constructed system of polynomial 
inequalities. Formally, a system of polynomial inequalities is an expression of the form 

5* = e e 


where 

• the y = {yi... ,yi) are free variables, 

• for all i € [u], the quantifier Qi is either 3 or V, 

• P{y, ..., x'b"'>) is a quantifier-free Boolean formula with m predicates of the form 

gi{y,x^'^\...,x^'"'^) A, 0 


where each gi is a real polynomial of degree d, and where the relations are of the form A^ € {<, > 
, = , 7 ^, <, <}. We call such predicates polynomial predicates. 

We say that j/ € is a X-approximate solution for this system of polynomial inequalities if there exists 
a t/' € such that y' satisfies the system and ||y — y^|l 2 A A. We use the following result by Renegar as a 
black-box: 

Fact 3 ([Ren92a, Ren92b]). Let 0 < X < rj and let S be a system of polynomial inequalities as defined above. 
Then there is an algorithm Solve-Poly-System(S', A, ry) that finds a X-approximate solution if there exists 
a solution y with |jj /||2 A V- If no such solution exists, the algorithm returns “NO-SOLUTION”. In any case, 
the algorithm runs in time 

{md)^ ^ logiog ^3 3-. 
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2.2.3 Shape-restricted polynomials 

Instead of fitting Gaussian pdfs to our density estimate directly, we work with piecewise polynomials as a 
proxy. Hence we need a good approximation of the Gaussian pdf with a piecewise polynomial. In order to 
achieve this, we use three pieces: two fiat pieces that are constant 0 for the tails of the Gaussian, and a 
center piece that is given by the Taylor approximation. 

Let let Tfi{x) be the degree-d Taylor series approximation to M around zero. It is straightforward to 
show: 


Lemma 4. Let e, K > 0 and let Td{x) denote the degree-d Taylor expansion of the Gaussian pdf Af around 
0. For d = 2it'log(l/e), we have 

f 2 y/GgT/l _ 

/ _ \Af{x) - Td{x)\dx < O (e^ Vlog(l/e)) • 

J2yt\ogl/e ^ ^ 

Definition 5 (Shape-restricted polynomials). Let K he such that 

p2^J\ogl/e g 

/ ,_ W{x)-T 2 K\og(l/e){x)\dx < 

d-2^1ogl/e 4 


From Lemma 4 we know that such a K always exists. For any e > 0, let Ve{x) denote the piecewise polynomial 
function defined as follows: 

ifxe [-2^1og(l/e), 2^1og(l/e)] 
otherwise 


Pe,e{x) = '^Wi-Ti- Ve{n{x - pLi)) . 

i=l 


log(l/£)(2:) 


For any set of parameters 0 € 0fc, let 


It is important to note that P^fi{x) is a polynomial both as a function of 0 and as a function of x. This 
allows us to fit such shape-restricted polynomials with a system of polynomial inequalities. Moreover, our 
shape-restricted polynomials are good approximations to GMMs. By construction, we get the following 
result: 


Lemma 6. Let 0 e 0fc. Then \\M .0 — < e. 

Proof. We have 


\\M9 


Pe,e\\i = J \M 0 {x) - Pe^g{x)\dx 

< ^Wi \Ti ■ N{Ti{x - lit)) - n ■ Peini^ 


(b) 


2 = 1 
k 


< • \\Af -Ve 


2=1 

k 


(c) ^ 

- / 
i=l 

< e . 


pLi))\dx 


Here, (a) follows from the triangle inequality, (b) from a change of variables, and (c) from the definition of 

V,. □ 
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2.2.4 AK-^orm and intersections of /c-GMMs 

In our system of polynomial inequalities, we must encode the constraint that the shape-restricted polynomials 
are a good fit to the density estimate. For this, the following notion of distance between two densities will 
become useful. 

Definition 7 (.4K-norm). Let 3 k denote the family of all sets of K disjoint intervals I = {Ji,..., Ik}- For 
any measurable function / : M — )■ M, we define the .4/c-norm of f to be 

WJWak sup f ■ 

For functions with few zero-crossings, the .4ic-norm is close to the Li-norm. More formally, we have the 
following properties, which are easy to check: 

Lemma 8. Let / : M —)■ M be a real function. Then for any K > 1, we have 

ll/IU, < ll/lli- 

Moreover, if f is continuous and there are at most K — 1 distinct values x for which f{x) = 0, then 

ll/IU, = ll/lli- 

The second property makes the M/c-norm useful for us because linear combinations of Gaussians have 
few zeros. 

Fact 9 ([KMVIO] Proposition 7). Let f be a linear combination of k Gaussian pdfs with variances ai,... ,ak 
so that Oi 7 ^ Oj for all i j ■ Then there are at most 2{k — 1) distinct values x such that f(x) = 0. 

These facts give the following corollary. 

Corollary 10. Let 61,62 € 0fe and let K > 4fc. Then 

Proof. For any 7 > 0, let 6j,6} be so that |j6>7 ~ — 7 for t e {1,2}, and so that the variances of 

all the components in 6j,6f are all distinct. Lemma 8 and Fact 9 together imply that WMgj — Mgy\\i = 
||MgT — Wak- Letting 7 —> 0 the LHS tends to \\Mei — IU^> and the RHS tends to \\Mg.^ — Mg^ ||;^. 
So we get that IIAlgi — ||_ 4 ^ = as claimed. □ 

3 Proper learning in the well-behaved case 

In this section, we focus on properly learning a mixture of k Gaussians under the assumption that we have 
a ‘Veil-behaved” density estimate. We study this case first in order to illustrate our use of shape-restricted 
polynomials and the M/c-norm. Intuitively, our notion of “well-behavedness” requires that there is a good 
GMM fit to the density estimate such that the mixture components and the overall mixture distribution 
live at roughly the same scale. Algorithmically, this allows us to solve our system of polynomial inequalities 
with sufficient accuracy. In Section 4, we remove this assumption and show that our algorithm works for all 
univariate mixtures of Gaussians and requires no special assumptions on the density estimation algorithm. 

3.1 Overview of the Algorithm 

The first step of our algorithm is to learn a good piecewise-polynomial approximation Pdens for the unknown 
density /. We achieve this by invoking recent work on density estimation [ADLS15]. Once we have obtained 
a good density estimate, it suffices to solve the following optimization problem: 

min ||pd ens MeWi . 
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Instead of directly fitting a mixture of Gaussians, we use a mixture of shape-restricted piecewise polynomials 
as a proxy and solve 

min||pdens - Pefilll ■ 

Now all parts of the optimization problem are piecewise polynomials. However, we will see that we cannot 
directly work with the Li-norm without increasing the size of the corresponding system of polynomial 
inequalities substantially. Hence we work with the Ak-^otih instead and solve 

mmjlpdens - PcAak ■ 

We approach this problem by converting it to a system of polynomial inequalities with 

1. 0{k) free variables: one per component weight, mean, and precision, 

2. Two levels of quantification: one for the intervals of the ^jf-norm, and one for the breakpoints of the 
shape-restricted polynomial. Each level quantifies over 0{k) variables. 

3. A Boolean expression on polynomials with many constraints. 

Finally, we use Renegar’s algorithm to approximately solve our system in time (fc log \ Because we 

only have to consider the well-behaved case, we know that finding a polynomially good approximation to 
the parameters will yield a sufficiently close approximation to the true underlying distribution. 

3.2 Density estimation, rescaling, and well-behavedness 

Density estimation As the first step of our algorithm, we obtain an agnostic estimate of the unknown 
probability density /. For this, we run the density estimation subroutine ESTlMATE-DENSiTY(fc, e, (5) from 
Fact 2. Let Pdens resulting 0(A:)-piecewise polynomial. In the following, we condition on the event 

that 

ll/-Pdenslll < d-OPTfc+e. 

which occurs with probability 1 — 5. 

Rescaling Since we can solve systems of polynomial inequalities only with bounded precision, we have to 
post-process the density estimate. For example, it could be the case that some mixture components have 
extremely large mean parameters fj,i, in which case accurately approximating these parameters could take 
an arbitrary amount of time. Therefore, we shift and rescale Pdens that its non-zero part is in [—1,1] (note 
that Pdens can only have finite support because it consists of a bounded number of pieces). 

Let Pdens be the scaled and shifted piecewise polynomial. Since the Li-norm is invariant under shifting 
and scaling, it suffices to solve the following problem 

min ||pd ens MeWi . 

Once we have solved this problem and found a corresponding 6 with 

Ibdens — 111 ^ C' 

for some C > 0, we can undo the transformation applied to the density estimate and get a O' G &k such that 

Ibdens-.A'e'lll < 
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Well-behavedness While rescaling the density estimate interval [—1,1] controls the size of 

the mean parameters fXi, the precision parameters Ti can still be arbitrarily large. Note that for a mixture 
component with very large precision, we also have to approximate the corresponding very accurately. For 
clarity of presentation, we ignore this issue in this section and assume that the density estimate is well- 
behaved. This assumption allows us to control the accuracy in Renegar’s algorithm appropriately. We revisit 
this point in Section 4 and show how to overcome this limitation. Formally, we introduce the following 
assumption: 

Definition 11 (Well-behaved density estimate). Let he a density estimate and let pdens be the resealed 
version that is supported on the interval [—1,1] only. Then we say pdens is 'y-well-behaved if there is a set of 
GMM parameters 9 € 0^ such that 

jjPdens .^^11 1 — lll^dens 

and Ti < "f for all i £ [fc]. 

The well-behaved case is interesting in its own right because components with very high precision param¬ 
eter, i.e., very spiky Gaussians, can often be learnt by clustering the samples.^ Moreover, the well-behaved 
case illustrates our use of shape-restricted polynomials and the ^^-distance without additional technical 
difficulties. 


3.3 The ^iy-norm as a proxy for the Li-norm 

Computing the Li-distance between the density estimate pdens and our shape-restricted polynomial approx¬ 
imation exactly requires knowledge of the zeros of the piecewise polynomial pdens — Pt.e- In a system of 
polynomial inequalities, these zeros can be encoded by introducing auxiliary variables. However, note that 
we cannot simply introduce one variable per zero-crossing without affecting the running time significantly: 
since the polynomials have degree 0(log 1/e), this would lead to 0 {k log 1/e) variables, and hence the running 
time of Renegar’s algorithm would depend exponentially on 0(logl/e). Such an exponential dependence 
on log(l/e) means that the running time of solving the system of polynomial inequalities becomes super¬ 
polynomial in I, while our goal was to avoid any polynomial dependence on I when solving the system of 
polynomial inequalities. 

Instead, we use the Miy-norm as an approximation of the Li-norm. Since both and pdens sxe close 
to mixtures of k Gaussians, their difference only has 0{k) zero crossings that contribute significantly to the 
Li-norm. More formally, we should have jjpdens — ^£,e||i ~ Ijpdens — Pe,e\\AK- ^’^d indeed: 

Lemma 12. Let e > 0, k > 2, 9 £ 0^, and K = 4fc. Then we have 


0 < Ibd ens lli-lbd ens Pe, < 8-0PTfc + 0(e) . 


Proof. Recall Lemma 8: for any function /, we have ||/|U^ < |!/||i. Thus, we know that Ijpdens — Pe.ejUjc < 
Ibdens - Pe.eh- Hence, it suffices to show that ||pdens - Pe.eh < 8 • OPTfc -k 0{e) -k ||pdens - PcAak- 

We have conditioned on the event that the density estimation algorithm succeeds. So from Fact 2, we 
know that there is some mixture of k Gaussians Mg' so that jjpdens — A4e'||i < 4 • OPT^ -k e. By repeated 
applications of the triangle inequality and Gorollary 10, we get 


lindens 


P<ifi\W ^ WPderis — Mg'Wi-\-\\M b'— Mg\\i-\-\\Pe,g — Mb\\i 

< 4 • OPT -k e -k ||A4e' — -k e 

< 4 • OPT -k 2e -k \\Mlg' — PdenslUjt + IlPdens — Ph.^Wak + ~ -^ellAK 

< 4 • OPT -k 2e -k — Pdensjjl + jjpdens — Pc.sWAk + ~ Afejjl 

< 8 • OPT -k 4e -k jjpdens — Pc.sWak ) 


^However, very spiky Gaussians can still be very close, which makes this approach challenging in some cases 
for details. 


see Section 4 
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as claimed. 


□ 


Using this connection between the ^/f-norm and the Li-norm, we can focus our attention on the following 
problem: 

mm ||pd ens Pe, 0 \\A^ • 

As mentioned above, this problem is simpler from a computational perspective because we only have to 
introduce 0 {k) variables into the system of polynomial inequalities, regardless of the value of e. 

When encoding the above minimization problem in a system of polynomial inequalities, we convert it to 
a sequence of feasibility problems. In particular, we solve 0(log(l/e)) feasibility problems of the form 

Find 0 e 0fc s.t. ||pd ens p^Aak <’^ ■ (1) 

Next, we show how to encode such an A/c-constraint in a system of polynomial inequalities. 

3.4 A general system of polynomial inequalities for encoding closeness in Ak- 
norm 

In this section, we give a general construction for the A/c-distance between any fixed piecewise polynomial 
(in particular, the density estimate) and any piecewise polynomial we optimize over (in particular, our 
shape-restricted polynomials which we wish to fit to the density estimate). The only restriction we require 
is that we already have variables for the breakpoints of the polynomial we optimize over. As long as these 
breakpoints depend only polynomially or rationally on the parameters of the shape-restricted polynomial, 
this is easy to achieve. Presenting our construction of the A/c-constraints in this generality makes it easy 
to adapt our techniques to the general algorithm (without the well-behavedness assumption, see Section 4) 
and to new classes of distributions (see Section 5). 

The setup in this section will be as follows. Let p be a given, fixed piecewise polynomial supported on 
[—1,1] with breakpoints ci,..., c^. Let P be a set of piecewise polynomials so that for all 0 € S' C IR“ for 
some fixed, known S, there is a Pe{x) G V with breakpoints di{ 6 ),..., ds{9) such that 

• S is a semi-algebraic set.® Moreover, assume membership in S can be stated as a Boolean formula over 
R polynomial predicates, each of degree at most Di, for some S, Di. 

• For all 1 < i < s, there is a polynomial hi so that hi{di{ 6 ), 6 ) = 0, and moreover, for all 9, we have 
that di{9) is the unique real number y satisfying hi{y,9) = 0. That is, the breakpoints of Pg can be 
encoded as polynomial equality in the 9’s. Let £>2 be the maximum degree of any hi. 

• The function (x,9) Pff(x) is a polynomial in x and 9 as long as x is not at a breakpoint of Pg. Let 
D 3 be the maximum degree of this polynomial. 

Let D = max(Di, D 2 , D^). 

Our goal then is to encode the following problem as a system of polynomial inequalities: 

Find 9gS s.t. \\p - Pg\\AK < ^ ■ (2) 

In Section 3.5, we show that this is indeed a generalization of the problem in Equation (1), for suitable 
choices of S and V. 

In the following, let = p — Pg. Note that p^'® is a piecewise polynomial with breakpoints contained 
in {ci,.. .Cr,di{9), ■ ■ ■ ,dsi9)}. In order to encode the A/c-constraint, we use the fact that a system of 
polynomial inequalities can contain for-all quantifiers. Hence it suffices to encode the A/c-constraint for a 
single set of K intervals. We provide a construction for a single Aic-constraint in Section 3.4.1. In Section 
3.4.2, we introduce two further constraints that guarantee validity of the parameters 9 and combine these 
constraints with the A/c-constraint to produce the full system of polynomial inequalities. 

®Recall a semi-algebraic set is a set where membership in the set can be described by polynomial inequalities. 
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3.4.1 Encoding Closeness for a Fixed Set of Intervals 


Let [ai, &i],..., [qk, be K disjoint intervals. In this section we show how to encode the following con¬ 
straint : 



< i' . 


Note that a given interval [ai,bi] might contain several pieces of In order to encode the integral over 
[ai, bi] correctly, we must therefore know the current order of the breakpoints (which can depend on 0 ). 

However, once the order of the breakpoints of and the Ui and bi is fixed, the integral over [ai,bi] 
becomes the integral over a fixed set of sub-intervals. Since the integral over a single polynomial piece is still 
a polynomial, we can then encode this integral over [oj, bi] piece-by-piece. 

More formally, let $ be the set of permutations of the variables 


6 ]^,..., bp^ , Cl ,..., , di(0),...,4(0)} 


such that (i) the ai appear in order, (ii) the bi appear in order, (iii) ai appears before bi, and (iv) the Ci 
appear in order. Let t = 2K -|- r -|- s. For any (p = {(pi ,..., (pt) G 4’, let 


t-i 

orderedp,'p((()) f\{(pi < pi+i) . 

i=l 

Note that for any fixed p, this is an unquantified Boolean formula with polynomial constraints in the unknown 
variables. The order constraints encode whether the current set of variables corresponds to ordered variables 
under the permutation represented by p. An important property of an ordered p is the following: in each 
interval [pi, pi+i\, the piecewise polynomial has exactly one piece. This allows us to integrate over p^'® 
in our system of polynomial inequalities. 

Next, we need to encode whether a fixed interval between pi and pi+i is contained in one of the Ak- 
intervals, i.e., whether we have to integrate pg‘® over the interval [pi,pi+i] when we compute the A/c-norm 
of Pg'®. We use the following expression: 

if there is a j such that aj appears as or before pi in p 
and bj appears as or after pi+i . 

0 otherwise 


Note that for fixed p and i, this expression is either 0 or 1 (and hence trivially a polynomial). 

With the constructs introduced above, we can now integrate p^*® over an interval [pi,pi^i]. It remains 
to bound the absolute value of the integral for each individual piece. For this, we introduce a set of t new 
variables ^i,... ,^t which will correspond to the absolute value of the integral in the corresponding piece. 


Aif-bounded-intervalp p {p,9,^,i) = 




Pe‘®(x) dx < 


V (is-activep_-p(^, i) = 0) . 


Note that the above is a valid polynomial constraint because p^'® depends only on 6 and x for fixed breakpoint 
order p and fixed interval [pi,pi+i\. Moreover, recall that by assumption, P^^g{x) depends polynomially on 
both 9 and x, and therefore the same holds for Pg*®. 

We extend the Mif-check for a single interval to the entire range of pg‘® as follows: 


t-i 

M/c-bounded-fixed-permutationp p,(((), 0, A Aic-bounded-intervalp ■p((/), 0, t) . 

i=l 
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We now have all the tools to encode the ^/c-constraint for a fixed set of intervals: 

^AT-boundedp .p(0, u, a, b, c, d, $) A ^/\ > 0) 

A \J orderedp_73((/)) A ^ic-bounded-fixed-permutationp .p((/), 0, 

By construction, the above constraint now satisfies the following: 

Lemma 13. There exists a vector^ € M* such that Ak- boundedp-p^O, v, a,b, c, d, is true if and only if 

K nbi 

Y, / pf^ix)dx < J.. 

Moreover, Ak- boundedp^p has less than 6t*'^^ polynomial constraints. 

The bound on the number of polynomial constraints follows simply from counting the number of poly¬ 
nomial constraints in the construction described above. 

3.4.2 Complete system of polynomial inequalities 

In addition to the .4/c-constraint introduced in the previous subsection, our system of polynomial inequalities 
contains the following constraints: 

Valid parameters First, we encode that the mixture parameters we optimize over are valid, i.e., we let 

valid-parametersg(0) 9 G S . 

Recall this can be expressed as a Boolean formula over R polynomial predicates of degree at most D. 

Correct breakpoints We require that the di are indeed the breakpoints of the shape-restricted polynomial 
Pg. By the assumption, this can be encoded by the following constraint: 

S 

correct-breakpointSp(0,d) f\^{hi{di{9), 9) = 0) . 

i=l 

The full system of polynomial inequalities We now combine the constraints introduced above and 
introduce our entire system of polynomial inequalities: 

Sk.p,v.s{v) = Voi,... Ok, bi,... ybji : 

,..., ds, • ft ■ 

valid-parameters 5 (d) A correct-breakpointSp(d, d) A .4ic-boundedp p(d, a, 5, c, d, ^) . 
This system of polynomial inequalities has 

• two levels of quantification, with 2K and s -I- t variables, respectively, 

• u free variables, 

• R + s + polynomial constraints, 

• and maximum degree D in the polynomial constraints. 

Let 7 be a bound on the free variables, i.e., ||d ||2 < 7, and let A be a precision parameter. Then Renegar’s 
algorithm (see Fact 3) finds a A-approximate solution 9 for this system of polynomial inequalities satisfying 
l!d ||2 < 7, if one exists, in time 

+ s + loglog(3+ . 

A 
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3.5 Instantiating the system of polynomial inequalities for GMMs 

We now show how to use the system of polynomial inequalities developed in the previous subsection for our 
initial goal: that is, encoding closeness between a well-behaved density estimate and a set of shape-restricted 
polynomials (see Equation 1). Our fixed piecewise polynomial (p in the subsection above) will be pdens- The 
set of piecewise polynomials we optimize over (the set V in the previous subsection) will be the set Ve of all 
shape-restricted polynomials P^^e- Our S (the domain of 9) will be 0). C Qf^, which we define below. For 
each 0 e S', we associate it with Moreover: 

• Define 


0fc,7 


_ Wi = 1 ] A (Vt G [fc] : {wi > 0) A (7 > Ti > 0) A (—1 < /ii < 1)) 


that is, the set of parameters which have bounded means and variances. S is indeed semi-algebraic, 
and membership in S can be encoded using 2k + 1 polynomial predicates, each with degree Di = 1. 

• For any fixed parameter 9 G 0^, the shape-restricted polynomial Pg has s = 2k breakpoints by 
definition, and the breakpoints di{9),... ,d 2 k{d) of P^fi occur at 


d2i-i{9) = -{pi- 2Ti log(l/e)) , d2i{9) = — {pi + 2n log(l/e)) , for all 1 < i < A: . 

Thus, for all parameters 9, the breakpoints di{9),..., d 2 k{d) are the unique numbers so that so that 
Ti ■ d2i-i{9) - {pi - 2Ti log(l/e)) = 0 , Ti- d2i{9) - {pi P 2Ti log(l/e)) = 0 , for all 1 < t < fc , 


and thus each of the (ii( 0 ),..., d 2 k{d) can be encoded as a polynomial equality of degree D 2 = 2 . 

• Finally, it is straightforward to verify that the map (x, 0) —)■ Pefi{x) is a polynomial of degree D 3 = 
0(log 1/e) in {x, 0), at any point where x is not at a breakpoint of Pg. 

From the previous subsection, we know that the system of polynomial inequalities 

two levels of quantification, each with 0 {k) variables, it has polynomial constraints, and has maximum 
degree 0(logl/e) in the polynomial constraints. Hence, we have shown: 

Corollary 14. For any fixed e, the system of polynomial inequalities SK,pdeas,'P,,Bk (^) encodes Equation 
( 1 ). Moreover, for all 7, A > 0 , Renegar’s algorithm SOLVE-POLY-SYSTEM(S'/f_p^^_^^ ^ (^), A, 7) runs in 
time (fclog(l/e))‘^*^^"'Moglog(3-I- ^). 


3.6 Overall learning algorithm 

We now combine our tools developed so far and give an agnostic learning algorithm for the case of well- 
behaved density estimates (see Algorithm 1). 


3.7 Analysis 

Before we prove correctness of Learn-Well-Behaved-GMM, we introduce two auxiliary lemmas. 

An important consequence of the well-behavedness assumption (see Definition 11) are the following 
robustness properties. 

Lemma 15 (Parameter stability). Fix 2 > e > 0. Let the parameters 0,0' G 0fc he such that (i) Ti,T[ < 7 
for all i G [A:] and (ii) ||0 — 0'||2 < , for some universal constant C. Then 

\\Mg-Mg>\\i < e. 
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Algorithm 1 Algorithm for learning a mixture of Gaussians in the well-behaved case. 

1: function LEARN-WELL-BEHAVED-GMM(fc, e, 6 , 7) 

2: > Density estimation. Only this step draws samples. 

3: Pdens ^ ESTIMATE-DENSITY(fc, £, 6) 

4: > Rescaling 

5: Let pdens be a rescaled and shifted version of p^ens support of pdens is [—1,1]. 

6: Let a and (3 be such that pdens{x) = - l) 

7: > Fitting shape-restricted polynomials 

8 : K ^ Ak 

9: ^ e 

10: 0 ^ SOLVE-POLY-SYSTEM(S'/f,p^^„^,p^,e, (f)^3fc7) 

11; while 0 is “NO-SOLUTION” do 

12: V ^ 2-V 

13: 0 ^ SOLVE-POLY-SYSTEM(5'A:,pj^„^_-p^,efc,., 

14: > Fix the parameters 

15: for i = 1,..., A: do 

16: if Ti < 0, set iCi ^ 0 and set to be arbitrary but positive. 

17: Let W = Wi 

18: for z = 1,..., fc do 

19: Wi ^ WijW 



Before we prove this lemma, we first need a calculation which quantifies the robustness of the standard 
normal pdf to small perturbations. 

Lemma 16. For all2 > e > 0, there is a 6i = dife) = - , so that for all 6 < 6i, we have 

- ’ 1 IV Z 20Vlog(l/e) - ^ ^ 

||A/’(a:) — M{x + 5)||i < 0(e). 

Proof. Note that if e > 2 this claim holds trivially for all choices of 6 since the Li-distance between two 
distributions can only ever be 2. Thus assume that e < 2. Let I be an interval centered at 0 so that both 
Af{x) and JV{x + S) assign 1 — f weight on this interval. By standard properties of Gaussians, we know that 
|/| < 10-y/log(l/e). We thus have 
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Proof of Lemma 15. Notice the ^2 guarantee of Renegar’s algorithm (see Fact 3) also trivially implies an 
(•aa guarantee on the error in the parameters 9] that is, for all i, we will have that the weights, means, 
and variances of the two components differ by at most . By repeated applications of the triangle 

inequality to the quantity in the lemma, it suffices to show the three following claims: 

• For any /i, t, 

\\wiJ\f^^r{x) - W2Af^,r{x)\\l < ^ 

ifK-u; 2 |<Ci(f)^ 

• For any r < 7 , 

if Imi -Ai 2 | < C'i (f)^ 

• For any /i, 

\Wl^,rAx) - J\ff,,r2{x)\\l < ^ 

if|Ti-T 2 |<Ci(f)^ 

The first inequality is trivial, for C sufficiently small. The second and third inequalities follow from a change 
of variables and an application of Lemma 16. □ 

Recall that our system of polynomial inequalities only considers mean parameters in [—1,1]. The following 
lemma shows that this restriction still allows us to find a good approximation once the density estimate is 
rescaled to [— 1 , 1 ]. 

Lemma 17 (Restricted means). Let g : M —>■ M &e a function supported on [—1,1], i.e., g{x) = 0 for 
X [—1,1]. Moreover, let 9* € 0^. Then there is a 9' G such that /i( € [—1,1] for all i € [k] and 

\\g-Me>A < 5- \\g - . 

Proof. Let A = {i\pL* € [—1,1]} and R = [/c] \ A. Let 9' be defined as follows: 

• w'^ = w* for all i G [A:]. 

• At = Ti ^01 i G A and /r' = 0 for i € B. 

• A = L* fo'^ ^ [^]- 

From the triangle inequality, we have 

\\9-Ms,\\i < \\g-Me4iT\\M0.-MAi- (3) 

Hence it suffices to bound \\Mg> — Mg'lh. 

Note that for i G B, the corresponding Tth component has at least half of its probability mass outside 
[—1, Ij. Since g is zero outside [—1,1], this mass of the z-th component must therefore contribute to the error 
\\g — Let l[a; ^ [—1,1]] be the indicator function of the set M \ [—1,1]. Then we get 


\\9-Me4i > \\Me^ 


l[a.^[-l,l]]|li > X 


E 

zee 


Wi 


■Mu. 
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For i G A, the mixture components of Aie* and M.gi match. Hence we have 


WMe^-Me'W^ = 


X! .-"f “ X! 


iGB 


ieB 


< 





ieB 

1 

i&B 


= 2 -\Y,wt-Khrr 

i^B 

< 4- \\g-M04^ . 

Combining this inequality with (3) gives the desired result. 


□ 


We now prove our main theorem for the well-behaved case. 

Theorem 18. Let (5, e, 7 > 0, k> 1, and let f be the pdf of the unknown distribution. Moreover, assume 
that the density estimate p^ens obtained in Line 3 of Algorithm 1 is ^-well-behaved. Then the algorithm 
LEARN-WELL-BEHAVED-GMM(fc, e, (5, 7 ) returns a set of GMM parameters 9' such that 


||Xe'-/|li < 60-OPTfc + e 


with probability 1 — 5. Moreover, the algorithm runs in time 


( 1 

I fc • log - I • log - 


k'^ ~ 

log log — O 

e 



Proof. First, we prove the claimed running time. From Fact 2, we know that the density estimation step has 
a time complexity of 0(4-)- Next, consider the second stage where we fit shape-restricted polynomials to the 
density estimate. Note that for v = Z, the system of polynomial inequalities is trivially satisfiable 

because the Ak-^otih is bounded by the Li-norm and the Li-norm between the two (approximate) densities 
is at most 2-|-0(e). Hence the while-loop in the algorithm takes at most 0(log i) iterations. Combining this 
bound with the size of the system of polynomial inequalities (see Subsection 3.4.2) and the time complexity 
of Renegar’s algorithm (see Fact 3), we get the following running time for solving all systems of polynomial 
inequalities proposed by our algorithm: 



• log log 


e 



This proves the stated running time. 

Next, we consider the correctness guarantee. We condition on the event that the density estimation stage 
succeeds, which occurs with probability 1 — 5 (Fact 2). Then we have 


ll/-Pde„slli<4-OPT, + e. 


Moreover, we can assume that the rescaled density estimate pdens is 7 -well-behaved. Recalling Definition 11, 
this means that there is a set of GMM parameters 6 G Qk such that Ti < ^ for all i G [fc] and 


IlPdens .^^11 1 — IlPdens 

0 * GOk 

= lbdens--^e*lll 

- lbdens-/lll + 11 /111 

(7 Etzfc 

< 4 • OPTfc -I- e -I- min 11/-A4e.||i 

e-e 0 k 

< 5-OPTfc+ e. 
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Applying the triangle inequality again, this implies that 

Ibdens — -Pe.elli < UPdens — || 1 + || — -Pe.S || 1 < 5 • OPTfe + 2 e . 

This almost implies that is feasible for > 5 • OPT^ + 2e. However, there are two remaining 

steps. First, recall that the system of polynomial inequalities restricts the means to lie in [—1,1]. Hence we 
use Lemma 17, which implies that there is a 6 *'f € 0fc such that G [—1,1] and 

Ilpdens ~ Pe,9't 111 ^ 25 • OPTfc + lOc . 

Moreover, the system of polynomial inequalities works with the A/c-norm instead of the Li-norm. Using 
Lemma 12, we get that 

IlPdens ~ ||_ 4 ^ ^ IlPdens ~ .fe, 6 lt|li ■ 

Therefore, in some iteration when 

< 2 • (25 • OPTfc + lOe) = 50 • OPTfc + 20e 

the system of polynomial inequalities become feasible and Renegar’s algorithm guarantees 

that we find parameters 9' such that ||0' — 0^||2 < y for some 6 *^ € Qk and 

Ibdens-TWetlU^ < 50 • OPTfc + 0(e) . 

Note that we used well-behavedness here to ensure that the precisions in 0^ are bounded by 7 . Let 9 be the 
parameters we return. It is not difficult to see that || 6 * — 9^\2 < We convert this back to an Li guarantee 
via Lemma 12: 

Ibdens-Xetlli < 56-OPTfc+ 0(e) . 

Next, we use parameter stability (Lemma 15) and get 

Ibdens-Xelli < 56-OPTfc+ 0(e) . 

We now relate this back to the unknown density /. Let 9' be the parameters 9 scaled back to the original 
density estimate (see Lines 21 to 23 in Algorithm 1). Then we have 

Ibdens Xs'lli < 56-OPTfc+ 0(e) . 

Using the fact that p^ns ^ good density estimate, we get 

\\f-Me4l < ll/-Pde„slll + lbde„s--M«'lll 

< 4 • OPTfc -k e -k 56 • OPTfc -k 0(e) 

< 60 • OPTfc -k 0(e) . 

As a final step, we choose an internal e' in our algorithm so that the 0{e') in the above guarantee becomes 
bounded by e. This proves the desired approximation guarantee. □ 


4 General algorithm 

4.1 Preliminaries 

As before, we let pdens be the piecewise polynomial returned by Learn-Piecewise-Polynomial (see Fact 
2). Let Iq, ..., Is+i be the intervals defined by the breakpoints of Pdens- Recall that pdens bas degree 
0(logl/e) and has s -k 2 = 0{k) pieces. Furthermore, Iq and Ig+i are unbounded in length, and on these 
intervals Pdens is zero. By rescaling and translating, we may assume WLOG that i® [~1) 1]- 

Recall that X is defined by the set of intervals {/i,... ,/s}. We know that s = Oik). Intuitively, these 
intervals capture the different scales at which we need to operate. We formalize this intuition below. 
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Definition 19. For any Gaussian let be the interval centered at ^ on which Af^^r places 

exactly W of its weight, where Q < W < 1 is a universal constant we will determine later. By properties of 
Gaussians, there is some absolute constant w > 0 such that > ojt for all x G LiAf^^r)- 

Definition 20. Say a Gaussian Affi^r is admissible if (i) Af^i^r places at least 1/2 of its mass in [—1,1], and 
(ii) there is a J G T so that \J H L(A//_t)| ^ 1 /(857") and so that 


where 

(j> = <j>{e, k) ^^m(m + 1)^ • {V2 + 1)™ , 

UJC 

where m is the degree o/pdens- We call the interval J € X satisfying this property on which A//,t places most 
of its mass its associated interval. 

Fix 9 € 0fc. We say the £-th component is admissible if the underlying Gaussian is admissible and 
moreover we > e/k. 

Notice that since m = 0(log(l/e)), we have that (j){e, k) = poly(l/e, k). 

Lemma 21 (No Interaction Lemma). Fix 9 € 0fc. Let Sgood{9) C [k] be the set of £ € [fc] whose corresponding 
mixture component is admissible, and let >S'bad(^) be the rest. Then, we have 


IIAle -Pdenslli > 


^eSgood(e) 


we ■ ,Ti Pdens 


+ b we -2e. 




We briefly remark that the constant | we obtain here is somewhat arbitrary; by choosing different 
universal constants above, one can obtain any fraction arbitrarily close to one, at a minimal loss. 

Proof. Fix £ € S'bad(^)) and denote the corresponding component Afe. Recall that it has mean ne and 
precision re. Let Le = L{Afe)- 

Let AAg^(x) = ^i^eWiA^Ui,'Ti{x) be the density of the mixture without the Ath component. We will 
show that 

11A10 Pdens 111 ^ II AIq Pdens 111 F ~jf ' 

It suffices to prove this inequality because then we may repeat the argument with a different £' € 5'bad(^) 
until we have subtracted out all such £, and this yields the claim in the lemma. 

\i We < ejk then this statement is obvious. If Afe places less than half its weight on [—1,1], then this is 
also obvious. Thus we will assume that we > e/k and Afe places at least half its weight on [—1,1]. 

Let le be the set of intervals in X which intersect Le. We partition the intervals in le into two groups: 

1. Let £i be the set of intervals J GXe so that | J n L^| < l/(8sTf). 

2 . Let £2 be the set of intervals J G le not in £1 so that 



By the definition of admissibility, this is indeed a partition of Xe. 
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We have 


T^denslli — Pdens||]^ 

= / \M'^‘'{x)+wt,Nt.{x)-pdeTis{.x)\dx+ / \M'^‘'{x)+Wt,Ul{x)-pdeTis{x)\dx 

J L(_ J 

> / \M'^^{x) + WlUl(x) - Pdens{x)\dx + / \Mg^{x)-pdens{x)\dx-We / Mt{x)dx 

J J J 

^ / \Mg^{x)+WeAfe{x)-pdens{x)\dx+ \M'^^{x) - Pdens{x)\dx - {I -W)wi . 

J Lg_ J 

We split the first term on the RHS into two parts, given by our partition: 

/ \M^\x)+WlMl{x)-pder,s{x)\dx= / \Mg\x) + WlMl{x) - Pder,s{x)\ dx 

J JC\L^ 

+ X! / + WiAfi{x) - Pdens{x)\ dx . 

JGC 2 “'•''nif 

We lower bound the contribution of each term separately. 

( 1 ) We first bound the first term. Since for each J € £1 we have \ J n < l/( 8 srf), we know that 

[ Ni{x)dx<^ (4) 

JjDL, 


and so 


/ \Mg\x) + WiJ\fe{x)-pdens{x)\dx> f jA^g^a;)-pdens(a;)| da; - |£i| • ^ 

> / \Mg\x) - Pdens{x)\ dx - ^we 

7/— J J (~\L p 


JGCi 


since I and thus £1 contains at most s intervals. 

(2) We now consider the second term. Fix a J € £ 2 , and let pj be the polynomial which is equal to pdens 
on J. Since f Pdens ^ 1 + e < 2 (as otherwise its £i-distance to the unknown density would be more than e) 
and Pdens is nonnegative, we also know that JjPj < 2 . We require the following fact (see [ADLS15]): 

Fact 22. Let p{x) = ® degree-m polynomial so that p > 0 on [—1,1] and /^P < /3. Then 

maxi|ci| < (3 ■ {m+ l)"^ • {V^ + I)™. 

Consider the shifted polynomial qj(u) = pj{u- {bj — aj)/2 + {bj + aj)/2) where J = [aj,bj]. By applying 
Fact 22 to qj and noting that qj = (2/| J|) • JjPj, we conclude that the coefficients of qj are bounded by 

±.(m + lf-{V2 + ir 

and thus 

kj('u)| < • Mm + 1 )^ • (\/2 + IM 

for all M G [—1,1], and so therefore the same bound applies for pj{x) for all x G J. 
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But notice that since we assume that J € £ 2 , it follows that for all a: € J H Li, we have that 

Afi{x) > 8^pj{x) , 

and so in particular WiMi{x) > 8 pj{x) for all a: G J H Li. Hence we have 

/ \M0^(x) + WiMi{x) - Pdensix) \dx= Mg\x) + WiNi{x) - pj{x) dx 

JjnLe JJnLi 

^ / \MQ^{x)-pj{x)\dx^ I ^WiJ\fi{x) - pj{x)dx 

JjnLi j JHL^ ^ 

^ [ \M'^^{x) - pj{x)\dx f Afi{x)dx. 

J.JnLf 4 JjnLp 


where the second line follows since Mg ^{x) + w^Afi —pj{x) > \Mg ^{x) — pj{x) \ + ^W£Afi{x) —pj{x) for all 

X G (/ n L^. 

Thus 


/ \Mg^{x)+WiMl{x)-pdens{x)\dx > 

J&C 2 

X! f / \M'^\x) - Pdens{x)\dx f Afi{x)dx 

^ JdL^, JC\L^ 


JGC 2 

Moreover, by Equation (4), we know that 


f Ni{x)dx= f Ni{x)dx— f J\fi{x)dx 

JDLi 'J Li iJf~\Li 


>w-l, 

since £1 contains at most s intervals. Thus, the RHS of Equation (5) must be lower bounded by 


Je£2 

Putting it all together. Hence, we have 


l-^e “ Pdensix) | dx + ^ ^ . 


(5) 


/ \Mg{x) - Pdens{x)\dx = y^ / |Me(x) - pdens(a:)| dx + / |Mg(x) - pdens(a:)| dx 

> / |Mg“^(x)-Pdens(a;)|dx+ y^ / |Mg“^(x) - pdens(a;)| dx 


JGCi 


JGC 2 


lU-Li-i 


Wi 


> / \Mg^{x)-pdens{x)\dx-\- 
J Li 


5 hr-11-1 


Wi . 
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We therefore have 


\\Mg-pdens\\l= / \Mg{x) - Pdens{x)\dx + / \Mg{x) - Pdens{x)\dx 

JLi JL^ 

> / |Me(a;)-pdens(a;)| da; + / \Mg^{x) - pdens{x)\dx - / wiAfi{x)dx 

J Li J L^ J L^ 

> I \Mg{x) - Pdens{x)\dx + j \Mg^ {x) - Pdens{x)\ dx - {1 - W)we 

JLi JL^ 

> J \Mg^{x)-pdens{x)\dx+ (J^W -^^We+ J (x) - Pdens{x)\ dx 


— 11 ^g Pdens 111 + , 


when we set W = 55/56. 


□ 


4.2 A parametrization scheme for a single Gaussian 

Intuitively, Lemma 21 says that for any 0 € 0fc, there are some components which have bounded variance 
and which can be close to pdens (the components in S'!), and the remaining components, which may have 
unbounded variance but which will be far away from pdens- Since we are searching for a fc-GMM which 
is close to Pdens) in some sense we should not have to concern ourselves with the latter components since 
they cannot meaningfully interact with Pdens- Thus we only need find a suitably robust parametrization for 
admissible Gaussians. 

Such a parametrization can be obtained by linearly transforming the domain so that the associated 
interval gets mapped to [—1,1]. Formally, fix a Gaussian and an interval J. Then it can be written as 


{x') 



x — mid( J) 
| T |/2 



( 6 ) 


for some unique p, and t , where for any interval I, we define mid(/) to denote its midpoint. Gall these the 
rescaled mean with respect to J and rescaled precision with respect to J of Af, respectively. Goncretely, given 
/i, r, and an interval J, the rescaled variance and mean with respect to J are defined to be 


r = 



M = 


|J|/2 


(/i — mid(J)) . 


For any p,f, we let AfAf{x) denote the function given by the RHS of Equation (6). The following two 
lemmas says that these rescaled parameters have the desired robustness properties. 

Lemma 23. Let Afji’f be an admissible Gaussian with rescaled mean p and rescaled precision f with respect 
to its associated interval J G X. Then p G and \fLK ■ w/(16s) < f < (j)/2. 

Proof. We first show that /{Ids) < f < That the rescaled variance is bounded from above follows 

from a simple change of variables and the definition of admissibility. By the definition of admissibility, we 
also know that 


A/T;^dx> 


Af/.’i dx 


JnL{^rr■/) 




>u}T-\j nL(7VT;^)| 


CJ 

> — . 
Ss 
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Furthermore, we trivially have 


\J\- 



> 



Thus, the precision t must be at least ^/^ojj (8s| J|), and so its rescaled precision must be at least •\/2^a;/(16s), 
as claimed. 

We now show that h e ^£01^ Because Afr,’i is an admissible Gaussian with associated interval J, 

we know that | J n L{N^'l)\ > l/(8sr). Moreover, we know that on J n we have > ujt. 

Thus in particular 


A/T:^dx> 






Define J to be the interval which is of length 8s| J|/w around mid(J). We claim that /i € J, where /i is 
the mean of A/XV • 

f-L^T 

Assume that mid(J) < Let Jq = J and inductively, for i < Asjoj, let Ji be the interval with left 
endpoint at the right endpoint of Ji_i and with length |J|. That is, the Ji consist of 4s/a; consecutive, 
non-intersecting copies of J starting at J and going upwards on the number line (for simplicity of exposition 
we assume that ds/w is an integer). Let Ji. We claim that /i € J^. Suppose not. This means 

that /i is strictly greater than any point in any Ji. In particular, this implies that for all i, 




'Ji-1 




> 


^ Jo 
(jO 

> — . 
- 85 


Nlidx 


But then this would imply that 


./t 


Mlidx = 


{As/io) — l 

E 

2=0 


'Ji 




Notice that is itself an interval. But any interval containing at least 1/2 of the weight of any Gaussian 
must contain its mean, which we assumed did not happen. Thus we conclude that fi G J^. Moreover, C J, 
so /i € J, as claimed. If mid( J) > p then apply the symmetric argument with Ji which are decreasing on 
the number line instead of increasing. 

We have thus shown that jj, G J. It is a straightforward calculation to show that this implies that 
fl G ^]- By the above, we know that r < (()/2 and thus jl G ^]j as claimed. 


□ 


Lemma 24. For any interval J, and /ii, fi, /i 2 , 72 so that \ fi\ < 2(j) for i G {1, 2} and \fli — fi 2 \ + \ti — T 2 \ < 
0{{e/{(j)k))'^), we have 

Proof. This follows by a change of variables and Lemma 16. □ 

Moreover, this rescaled parametrization naturally lends itself to approximation by a piecewise polynomial, 
namely, replace the standard normal Gaussian density function in Equation (6) with P^. This is the piecewise 
polynomial that we will use to represent each individual component in the Gaussian mixture. 
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4.3 A parametrization scheme for /c-GMMs 


In the rest of this section, our parametrization will often be of the form described above. To distinguish this 
from the previous notation, for any 9 € 0^, and any set of k intervals Ji,..., we will let O'" G Qj- denote the 
rescaled parameters so that if the t-th component in the mixture represented by 9 has parameters Wi,iJ,i,Ti, 
then the i-th component in the mixture represented by 0'’ has parameters so that . 

Notice that the transformation between the original and the rescaled parameters is a linear transformation, 
and thus trivial to compute and to invert. 

The final difficulty is that we do not know how many mixture components have associated interval J 
for J G I. To deal with this, our algorithm simply iterates over all possible allocations of the mixture 
components to intervals and returns the best one. There are 0{k) possible associated intervals J and k 
different components, so there are at most different possible allocations. In this section, we will see 

how our parametrization works when we fix an allocation of the mixture components. 

More formally, let A be the set of functions u : [s] —>■ N so that These will represent the 

number of components “allocated” to exist on the scale of each J^. For any v G A, define to be the set of 
If G 2 so that v(£) ^ 0. 

Fix 9'' G 0fc and v G A. Decompose 0” into {91,... ,9^), where 9^ contains the rescaled parameters with 
respect to Ji for the v{£) components allocated to interval (note that v{£) may be 0 in which case 9i is 
the empty set, i.e., corresponds to the parameters for no components). For any 1 < £ < s, let 




Ti 


Wj 


\h\l2 


N 


X — mid(/^) 

\h\l2 


where i ranges over the components that 9j corresponds to, and define y{x) 
define 


P, 


e,e,9] 






P. 


X — mid)/^) 

Vhm 


Similarly, 


and define P^g,- y(x) = Finally, for any v, define to be the set of all such P[g ,, 

We have: 


Lemma 25. For any 0 ” G 0fe, we have 


\\M 


e^,v 


-Pl0^Ji<e 


This follows from roughly the same argument as in the proof of Lemma 6, and so we omit the proof. 

We now finally have all the necessary language and tools to prove the following theorem: 

Corollary 26. Fix 2 > e > 0. There is some alloeation v G A and a set of parameters 9^ G 0fc so that 
fii G ^]> l/(8s) < fi < (j)/2, and W£ > e/(2fc) for all i. Moreover, 

||/-Afg..J|i<19-0PTfc + 0(e). 

Proof. Let 9* G Qk be so that ||/ — Aie* ||i = OPT^, and let Aff denote its i-th component with parameters 
w*, p,*, and T*. Decompose [k] into S'good(0*), «S'bad(^^*) as in Lemma 21. 

By the guarantees of the density estimation algorithm, we know that 


'y ' 'X’i ,t| Pdens 

i 1 


< 50PTfc + e. 


By Lemma 21, this implies that 


50PTfe + e > 


y Pdens 

i^Sgood (^* ) 


1 


\ wi-2e, 

^eSbad(e*) 
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from which we may conclude the following two inequalities: 


I] -Pdens < 5 • OPTfc + 3e, (7) 

^eSgood(«*) 1 

<10- OPTfc + 6 e . ( 8 ) 

^eSbad(^*) 

Let 9' be defined so that for all £ e >S'good(^*)) the means and variances of the ^-th component in O' are p* 
and T*, and so that for all £ G Shad{9*), the means of and variances of the ^-th component in 6' are arbitrary 
but so that the underlying Gaussian is admissible. Let the weights of the components in O' be the same as 
the weights in 0*. 

Then we have 

^^‘S'goodC^*) ■^^‘SbadC^*) ^ 

< Y + Y 

^e-SgoodC^*) ^ ^£5'bad(^*) 

= Y + E 

^eSgood(®*) ^eSbad(S*) 

— 'y W£j\ffj,2,T^ ~ Pdens + || / ~ Pdens || 1 + ^ ' W£ 

^eSgood(e*) ^ ^eSbad(e*) 

< 19 • OPTfc + 0(e) 

where the last line follows from Equation (7), the guarantee of the density estimation algorithm, and Equation 
( 8 ). 

For each £ G [fc], let Ji G I denote the interval so that the Gth component of O' is admissible with 
respect to Ji Let 0’’ be the rescaling of O' with respect to Ji,..., J^. Then by Lemma 23, O'" satisfies that 
Pi G and • uj/(IQs) < n < (/)/2 for all i. Let w € .4 be chosen so that v(i) is the number 

of times that li appears in the sequence Ji,..., Jfc. Then Aig/ and v satisfies all conditions in the lemma, 
except possibly that the weights may be too small. 

Thus, let 0 be the set of parameters whose means and precisions are exactly those of O', but for which the 
weight of the £-th component is defined to be wi = max(e/( 2 fc), ■u;|) for all 1 < < fc —1 and Wk = 

It is easy to see that 0 G Qk] moreover, \\Mg — A^ 0 '||i < e. Then it is easy to see that 0 and v together 
satisfy all the conditions of the lemma. □ 

4.4 The full algorithm 

At this point, we are finally ready to describe our algorithm LearnGMM which agnostically and properly 
learns an arbitrary mixture of k Gaussians. Informally, our algorithm proceeds as follows. First, using 
Estimate-Density, we learn a p^ens with high probability is e-close to the underlying distribution / 
in Li-distance. Then, as before, we may rescale the entire problem so that the density estimate is supported 
on [—1,1]. Gall the rescaled density estimate pdens- 

As before, it suffices to find a fc-GMM that is close to pdens in A/c-distance, for K = 4A: — 1. The following 
is a direct analog of Lemma 12. We omit its proof because its proof is almost identical to that of Lemma 
12 . 
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Lemma 27. Let e > 0,v G A, k > 2, 6^ G 0^, and K = 4(fc — 1) + 1. Then we have 
0 < |bdens-P,V.J|l-|bden.-PeV.JU^ < 8 • OPTfe + 0(e) . 

Our algorithm enumerates over all u G .4 and for each v finds a O'" approximately minimizing 

Ibdens ~ ^€,9’',v\\Ak ' 

Using the same binary search technique as before, we can transform this problem into log 1/e feasibility 
problems of the form 

Ibdens ~ Pe,9 ^,v\\Ak ^ V ■ (9) 

Fix V G A, and recall is the set of all polynomials of the form „• Let denote the set of O'" G 0fc 
so that fii G ^], •\/27rw/(8s) < f < (/)/2, and Wi > e/{2k), for all i. For any O'" G 0™'“^, canonically 

identify it with P^gr- „• By almost exactly the same arguments used in Section 3.5, it follows that the class 
Vly, where 9 G 0™*“^, satisfies the conditions in Section 3.4, and that the system of polynomial equations 
'S'ic,pdenB.'Pr„(^) has two levels of quantification (each with 0 (k) bound variables), has k^^^^ polynomial 
constraints, and has maximum degree 0(log(l/e)). Thus, we have 

Corollary 28. For any fixed e, v, and for K = ik — 1, we have that Sj^ Qvatid(v) encodes Equation 

(9) ranging over 9 G 0 ™*®'^. Moreover, for all 7 , A > 0, Solve-Poly-Program(S'j^ 'pr QvaLid 
runs in time 

(fclog(l/e))0('=')loglog(3+^) . 

For each v, our algorithm then performs a binary search over 7 to find the smallest (up to constant factors) 
r] so that Equation (9) is satisfiable for this v, and records both r/y, the smallest rj for which Equation (9) is 
satisfiable for this v, and the output of the system of polynomial inequalities for this choice of rj. We then 
return so that the rjyi is minimal over all v G A. The pseudocode for LearnGMM is in Algorithm 2. 
The following theorem is our main technical contribution: 

Theorem 29. LEARNGMM(fc, e, 5) takes 0((fc + log l/i5)/e^) samples from the unknown distribution with 
density f, runs in time 



and with probability 1 — (5 returns a set of parameters 9 G 0fc so that \\f — < 58 • OPT + e. 

Proof. The sample complexity follows simply because Estimate-Density draws 0((fc-|-log l/i5)/e^) samples, 
and these are the only samples we ever use. The running time bound follows because \A\ = k^^^^ and from 
Corollary 28. Thus it suffices to prove correctness. 

Let 9 be the parameters returned by the algorithm. It was found in some iteration for some v G A. Let 
V* , 9 * be those which are guaranteed by Corollary 26. We have 


Ibdens - lU^ < Ibden. - /||l + ||/ - M^g. ||l + - Pfg, ^y, ||i < 23 • OPTfc + 0 { e ) 


By the above inequalities, the system of polynomial equations is feasible for 7 < 46 • OPT^ -|- 0(e) in 
the iteration corresponding to v * (Corollary 26 guarantees that the parameters 9 * are sufficiently bounded). 
Hence, for some 77 ^* < 7 , the algorithm finds some 9 ' so that there is some 9 ” so that |b' — 0"||2 < Ci{e/ (fik))'^, 
which satisfies Sp^^^^ -pr- ^ 0 vaiid(r'„*). 

Let 9i be the set of parameters computed by the algorithm before rounding the weights back to the 
simplex (i.e. at Line 11). By our choice of precision in solving the polynomial program, (i.e. by our choice 
of A on Line 24 of Algorithm 2), we know that the precisions of the returned mixture are non-negative (so 
each component is a valid Gaussian). It was found in an iteration corresponding to some v G A, and there is 
some rjy < rjy* < 46 • OPTfc + 0(e) and some 0) satisfying the system of polynomial equalities for v and rjy, so 
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Algorithm 2 Algorithm for proper learning an arbitrary mixture of k Gaussians. 

1 : function LearnGMM(A;, e, i5) 

2 : > Density estimation. Only this step draws samples. 

3: Pdens ^ ESTIMATE-DENSITY(fc, £, 6) 

4: > Rescaling 

5: > Pdens is a rcscalcd and shifted version of such that the support of pdens is [— 1 , 1 ]. 

6 : Let pdens(a;) ='' - l) 

> Fitting shape-restricted polynomials 

for V & A do 

Pv,0l ^ FlNDFlTGlVENALLOCATION(pdens,'y) 

Let 0 so that 9^ = d^,, so that is minimal over all py (breaking ties arbitrarily). 

> Round weights back to be on the simplex 

for z = 1,..., fc — 1 do 

Wi Wi — e/2fc (This guarantees that ^ see analysis for details) 

If Wi > 1, set Wi = 1 

Wk ^ 1- YhZI Wi 

> Undo the scaling 

W'i ^ Wi 

^ + a 

return 6' 

21: function FlNDFlTGlVENALLOCATION(pdens, "f) 

22 : ZZ ^ e 

23: Let Cl be a universal constant sufficiently small. 

24: Let A ^ min(C'i(e/(())A:))^, l/16s, e/(4fc)) 

25: > This choice of precision provides robustness as needed by Lemma 24, and also ensures that all the 

weights and precisions returned must be non-negative. 

Let ip ^ 6ks(j)/uj + 3k(j)/2 + 1 

> By Gorollary 26, this is a bound on how large any solution of the polynomial program can be. 

9^ ^ Solve-Poly-System(5'p^^_^^ X,P^) 

while 9^ is “NO-SOLUTION” do’ 

ly ^ 2 ■ V 

0” ^ Solve-Poly-System)^^^^^^^ p,. evaiid(zz), A,z/>) 

32: return 0”,:^ 

that \\9i — 9 i\\2 < Ci{e/{(pk))"^. Let 9 be the set of rescaled parameters obtained after rounding the weights 
of 6*1 back to the simplex. It is straightforward to check that 9 G 0fc, and moreover, \\A4g „ — Aig, „||i < 2e, 
and so IIPj;^ „ - < 0 (e). 

We therefore have 

11/ - Me 111 < 11/ - Pdenslll + Ibdens - P/e.Jll + l|E/e.„ - X^.^.JIl 

< 4 • OPT -I- e -I- 8 • OPT -|- 0(e) -I- ||pdens ~ E/e „||^^ -|- e 
(*>) 

< 12 • OPT + 0{e) + Ibdens - 
(c) 

< 58 • OPT -k 0(e) , 
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where (a) follows from Lemmas 27 and 25, (b) follows from the arguments above, and (c) follows since O'l 
satisfies the system of polynomial inequalities for 77 ^ < 46 • OPT^ + 0(e). 

As a final step, we choose an internal e' in our algorithm so that the 0{e') in the above guarantee becomes 
bounded by e. This proves the desired approximation guarantee and completes the proof. □ 


5 Further classes of distributions 

In this section, we briefly show how to use our algorithm to properly learn other parametric classes of 
univariate distributions. 

Let C be a class of parametric distributions on the real line, parametrized by 0 G S' for S' C IR“. For 
each 9, let Fg G C denote the pdf of the distribution parametrized by 0 in C. To apply our algorithm in this 
setting, it suflrces to show the following: 

1. (Simplicity of C) For any 9i and 62 , the function Fg^ — Fg^ has at most K zero crossings. In fact it 
also sufflces if any two such functions have “essentially” K zero crossings. 

2. (Simplicity of S) S is a, semi-algebraic set. 

3. (Representation as a piecewise polynomial) For each 9 € S and any e > 0, there is a a piecewise 
polynomial P^^g so that \\Pe,g — Te||i < e. Moreover, the map {x,9) 1 —)■ P^^g{x) is jointly polynomial 
in X and 9 at any point so that x is not at a breakpoint of P^^e- Finally, the breakpoints of P^ g also 
depend polynomially on 9. 

4. (Robustness of the Parametrization) There is some robust parametrization so that we may assume that 

all “plausible candidate” parameters are < and moreover, if ||0i — 02 || < then 

11^01 - ^02 II < £• 

Assuming C satisfies these conditions, our techniques immediately apply. In this paper, we do not attempt 
to catalog classes of distributions which satisfy these properties. However, we believe such classes are 
often natural and interesting. We give evidence for this below, where we show that our framework produces 
proper and agnostic learning algorithms for mixtures of two more types of simple distributions. The resulting 
algorithms are both sample optimal (up to log factors) and have nearly-linear running time. 

5.1 Learning mixtures of simple distribution 

As a brief demonstration of the generality of our technique, we show that our techniques give proper and 
agnostic learning algorithms for mixtures of k exponential distributions and Laplace distributions (in addition 
to mixtures of k Gaussians) which are nearly-sample optimal, and run in time which is nearly-linear in the 
number of samples drawn, for any constant k. 

We now sketch a proof of correctness for both classes mentioned above. In general, the robustness 
condition is arguably the most difflcult to verify of the four conditions required. However, it can be verified 
that for mixtures of simple distributions with reasonable smoothness conditions the appropriate modification 
of the parametrization we developed in Section 4 will sufflce. Thus, for the classes of distributions mentioned, 
it suffices to demonstrate that they satisfy conditions (1) to (3). 

Condition 1: It follows from the work of [Tos06] that the difference of k exponential distributions or k 
Laplace distributions has at most 2k zero crossings. 

Condition 2: This holds trivially for the class of mixtures of exponential distributions. We need a bit of 
care to demonstrate this condition for Laplace distributions since a Laplace distribution with parameters 
/i, b has the form 

2b 
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and thus the Taylor series is not a polynomial in x or the parameters. However, we may sidestep this issue 
by simply introducing a variable y in the polynomial program which is defined to he y = \x — y\. 


Condition 3: It can easily be shown that a truncated degree 0(log 1/e) Taylor expansion (as of the form 
we use for learning fc-GMMs) suffices to approximate a single exponential or Laplace distribution, and hence 
a 0(fc)-piecewise degree 0(log 1/e) polynomial suffices to approximate a mixture of k exponential or Laplace 
distributions up to Li-distance e. 

Thus for both of these classes, the sample complexity of our algorithm is 0(fc/e^), and its running time 


is 



+ 0 



similar to the algorithm for learning fc-GMMs. As for fc-GMMs, this sample complexity is nearly optimal, 
and the running time is nearly-linear in the number of samples drawn, if k is constant. 
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